Because this work takes the form of a mini-thesis, where even a separate abstract was requested, I decided to also include some acknowledgements. Thus, I want to dedicate this work to the course lecturer and my cousin Kimmo, who has since the last autumn somehow tolerated my seemingly endless frustration with statistics and early mornings. Thank you for an interesting course!
In this work, I’ve investigated how well it is possible to explain the regional differences in the share of the population with at least secondary education using other socioeconomic factors. The research area was the City of Helsinki and the data had been collected by the city; the year of the data was 2015. I performed a Principal Component Analysis (PCA) on the data to reduce its dimensionality, and based on manual investigation of the data and the results of the PCA created a Linear Model (LM) that aims to explain the secondary or higher education ratio.
The results show quite clearly and rather surprisingly that the factor most clearly explaining the secondary education ratio is the rate of support for the Finns party in the national elections, even though unemployment rate, dependency on income support and population density also play a part; the Finns party support rate alone explains over 80% of the regional variability of the secondary or higher education rate.
My purpose is to investigate how well the regional differences in the share of the population with at least secondary education are explainable with other socioeconomic factors. To do that, I have utilised Helsinki Region Infoshare and downloaded a dataset that includes various district-level statistics of the City of Helsinki. The data is maintained by the (City of Helsinki Urban Facts Centre)[http://www.hel.fi/www/tieke/en] and was downloaded on 04/03/17. It is shared with the Creative Commons Attribution 4.0 International licence. The data is further explained by the Urban Facts Centre in this document, but the document is mostly in Finnish.
While I assume that the target variable – the share of the population with at least secondary education – is explainable by the other factors, the dataset doesn’t directly include this factor, and it includes a very large amount of other factors, most of which are most likely not relevant in explaining the target variable. Thus, I’ve created a separate R script to tidy up the data, create some new derived variables and join distinct parts of it together. The source code of that script is viewable at the following address: https://github.com/pinjaliina/IODS-final/blob/master/create_helsinki.R.
The script creates a more usable dataset that can be read in as follows:
helsinki <- as.data.frame(read.table('helsinki.csv', sep = '\t', header = TRUE))
The structure and dimensions of the data can be shown as follows:
glimpse(helsinki)
## Observations: 33
## Variables: 21
## $ fin_spk <dbl> 83.7, 75.4, 81.0, 83.3, 79.0, 82.2, 81.6, 82....
## $ swe_spk <dbl> 9.4, 16.6, 9.0, 9.8, 14.0, 7.3, 11.5, 6.4, 4....
## $ oth_spk <dbl> 6.9, 7.9, 10.0, 6.8, 7.0, 10.5, 6.9, 11.0, 13...
## $ foreign <dbl> 5.0, 6.1, 7.2, 5.4, 5.2, 6.7, 5.3, 8.4, 9.3, ...
## $ foreign_bkg <dbl> 8.2, 9.7, 11.1, 8.2, 8.1, 10.6, 8.4, 11.6, 13...
## $ hki_born <dbl> 39.0, 39.6, 35.6, 35.9, 40.6, 39.0, 43.1, 36....
## $ pop_dens <int> 6014, 4028, 8090, 7674, 5809, 3833, 3790, 498...
## $ emp_dens <int> 17362, 3851, 11557, 5827, 1796, 5508, 1288, 2...
## $ inc_cap <int> 40688, 46112, 42065, 36416, 37796, 28450, 377...
## $ sq_m_cap <dbl> 38.9, 37.6, 36.4, 35.6, 35.2, 31.7, 37.6, 31....
## $ health_cap <dbl> 9.4, 7.6, 8.8, 11.1, 10.6, 10.3, 13.8, 13.3, ...
## $ inc_supp <dbl> 4.0, 4.8, 5.8, 5.2, 3.9, 9.3, 4.8, 10.3, 12.2...
## $ child_prot <dbl> 5.6, 4.4, 6.2, 3.6, 3.2, 8.5, 4.7, 8.3, 11.5,...
## $ unemp_rate <dbl> 7.4, 7.8, 8.3, 7.0, 7.9, 10.3, 7.5, 10.6, 11....
## $ emp_rate <dbl> 76.7, 76.3, 75.4, 77.7, 78.3, 73.4, 74.3, 71....
## $ carless <dbl> 54.4, 56.4, 60.5, 60.1, 48.1, 61.2, 50.3, 58....
## $ edu_sec <dbl> 74.0, 75.2, 70.5, 77.9, 70.9, 68.8, 69.8, 66....
## $ dw_rent <dbl> 44.2, 40.4, 58.4, 46.5, 36.0, 75.1, 49.1, 68....
## $ pop_growth <dbl> 0.9, 0.7, 1.8, 0.4, 1.6, 1.3, 0.4, 0.4, 1.8, ...
## $ VIHR <dbl> 21.9, 20.0, 21.6, 22.6, 18.2, 22.9, 17.1, 20....
## $ PS <dbl> 5.4, 3.9, 5.7, 5.0, 6.2, 7.7, 6.6, 10.1, 13.6...
As seen above, the data includes 31 observations of 21 variables. If we take a look of the names of the single observations, we see that they match with the names of the city districts:
rownames(helsinki)
## [1] "101 Vironniemen peruspiiri" "102 Ullanlinnan peruspiiri"
## [3] "103 Kampinmalmin peruspiiri" "104 Töölön peruspiiri"
## [5] "105 Lauttasaaren peruspiiri" "201 Reijolan peruspiiri"
## [7] "202 Munkkiniemen peruspiiri" "203 Haagan peruspiiri"
## [9] "204 Pitäjänmäen peruspiiri" "205 Kaarelan peruspiiri"
## [11] "301 Kallion peruspiiri" "302 Alppiharjun peruspiiri"
## [13] "303 Vallilan peruspiiri" "304 Pasilan peruspiiri"
## [15] "305 Vanhankaupungin peruspiiri" "401 Maunulan peruspiiri"
## [17] "402 Pakilan peruspiiri" "403 Tuomarinkylän peruspiiri"
## [19] "404 Oulunkylän peruspiiri" "405 Pakilan peruspiiri"
## [21] "501 Latokartanon peruspiiri" "502 Pukinmäen peruspiiri"
## [23] "503 Malmin peruspiiri" "504 Suutarilan peruspiiri"
## [25] "505 Puistolan peruspiiri" "506 Jakomäen peruspiiri"
## [27] "601 Kulosaaren peruspiiri" "602 Herttoniemen peruspiiri"
## [29] "603 Laajasalon peruspiiri" "701 Vartiokylän peruspiiri"
## [31] "702 Myllypuron peruspiiri" "703 Mellunkylän peruspiiri"
## [33] "704 Vuosaaren peruspiiri"
We can also see that the district number 801, the Östersundom district (Östersundomin peruspiiri) is missing. This is because the area of that district was only joined to the city on 2009, and there was no population data available before that year.
The shortened variable names in the data can be more thoroughly explained as follows:
Note of the political parties: Greens (VIHR) is a value-liberal, slightly left-leaning pro-urbanisation party. Finns (PS) is a populist and value-conservative racism-flirting right wing party. To avoid choosing all of the parties, I chose two that are relative polar opposites (even if not on the traditional left–right scale.
A graphical overview of the data and summaries of its variables can be displayed as follows:
ggpairs(helsinki, mapping = aes(), lower = list(combo = wrap("facethist", bins = 10)), upper = list(continuous = wrap("cor", size=3)))
summary(helsinki)
## fin_spk swe_spk oth_spk foreign
## Min. :70.50 Min. : 1.500 Min. : 3.6 Min. : 2.000
## 1st Qu.:79.00 1st Qu.: 3.200 1st Qu.: 7.9 1st Qu.: 5.700
## Median :82.30 Median : 4.000 Median :10.0 Median : 6.800
## Mean :82.41 Mean : 5.788 Mean :11.8 Mean : 7.742
## 3rd Qu.:86.70 3rd Qu.: 6.600 3rd Qu.:16.2 3rd Qu.: 9.500
## Max. :93.50 Max. :19.200 Max. :28.0 Max. :18.600
## foreign_bkg hki_born pop_dens emp_dens
## Min. : 3.90 Min. :27.60 Min. : 960 Min. : 85
## 1st Qu.: 8.80 1st Qu.:39.00 1st Qu.: 2544 1st Qu.: 529
## Median :10.60 Median :42.60 Median : 3194 Median : 1339
## Mean :11.82 Mean :42.24 Mean : 4014 Mean : 2881
## 3rd Qu.:14.60 3rd Qu.:46.50 3rd Qu.: 4506 3rd Qu.: 3692
## Max. :25.90 Max. :55.20 Max. :12887 Max. :17362
## inc_cap sq_m_cap health_cap inc_supp
## Min. :17077 Min. :28.40 Min. : 7.60 Min. : 1.80
## 1st Qu.:24386 1st Qu.:32.50 1st Qu.:10.50 1st Qu.: 5.50
## Median :26501 Median :33.80 Median :11.10 Median :10.10
## Mean :29433 Mean :34.41 Mean :11.28 Mean :10.39
## 3rd Qu.:34599 3rd Qu.:36.40 3rd Qu.:12.30 3rd Qu.:13.80
## Max. :46962 Max. :42.90 Max. :17.60 Max. :27.40
## child_prot unemp_rate emp_rate carless
## Min. : 2.100 Min. : 6.60 Min. :57.80 Min. :17.90
## 1st Qu.: 6.000 1st Qu.: 8.30 1st Qu.:67.40 1st Qu.:47.00
## Median : 8.500 Median :10.70 Median :71.30 Median :51.00
## Mean : 8.803 Mean :11.18 Mean :70.47 Mean :50.48
## 3rd Qu.:10.500 3rd Qu.:12.70 3rd Qu.:74.00 3rd Qu.:58.00
## Max. :19.600 Max. :18.50 Max. :78.30 Max. :75.50
## edu_sec dw_rent pop_growth VIHR
## Min. :47.00 Min. : 8.90 Min. :-0.3000 Min. : 9.90
## 1st Qu.:56.30 1st Qu.: 49.10 1st Qu.: 0.3000 1st Qu.:14.10
## Median :61.90 Median : 67.30 Median : 0.4000 Median :17.40
## Mean :62.72 Mean : 64.13 Mean : 0.9242 Mean :17.96
## 3rd Qu.:68.80 3rd Qu.: 82.40 3rd Qu.: 1.3000 3rd Qu.:21.30
## Max. :77.90 Max. :119.70 Max. : 4.1000 Max. :28.60
## PS
## Min. : 3.90
## 1st Qu.: 7.70
## Median :11.40
## Mean :11.92
## 3rd Qu.:16.10
## Max. :27.30
As seen from the plot, most of the variables are quite skewed and not very close to normal distribution, even though there are some exceptions; statistically it is quite unsurprising that most of the exceptions involve variables generally including large amount of the total population, such as the ratio of Finnish speakers and the ratio of Helsinki natives. The summary further reveals that many of the variables have significant outliers – i.e. variables with then min or max values very far away from their interquartile range (e.g. Swedish speakers ratio, foreign population ratio, income support ratio, carless ratio). This somewhat hints toward potential segregation of the districts, though some of them, like population density, are probably better explained by other factors, and it would in any case be rather dangerous to jump hastily into conclusions without any further analysis.
The above overview plot already shows that there are not just some clear linear dependencies but also some very significant correlations between the variables, even if some of them are quite self-explanatory such as the correlation between foreign language speakers ratio and the foreigners ratio, which is very close to one. However, due to the large amount of variables it is not very easy to find all the highest correlations from the above plot, so it is useful to plot separately the correlations alone:
cor_matrix <- round(cor(helsinki), digits = 2)
corrplot(cor_matrix, method = "circle", type = "upper", tl.pos = "d", tl.cex = 0.8)
We can now see more clearly that there are some interesting very high correlations besides the very obvious ones, such as the foreign background related variables (foreigner rate, foreign background rate and other speakers rate) correlating highly with variables related to low income related variables (income per capita, income support rate and unemployment rate). Socially potentially quite alarming, high child protection rate also seems to relate with foreign background. However, it is interesting to note that the same low income related variables and high child protection rate are also related to high support of the Finns party. My chosen target variable, secondary or higher education ratio, also has a relative high negative correlation with the Finns party support, but on the other hand a high positive correlation with the Green party support. In addition, it correlates highly with the employment rate.
While creating the dataset, I intentionally included quite large part of the variables included in the source data, as I was not really sure which of them would actually best explain the target variable. However, as the above exploration of the data has made clear, there is clearly too many variables explaining roughly the same dimensions of the data and thus too much dimensionality. To reduce it, I’m going to utilise a dimensionality reduction method called Principal Component Analysis (PCA). The purpose is to try and extract the most essential dimensions out of the data by transforming the data and use these principal components for further analysis of the data, in hope that the principal components explaining most of the total variance in the data characterise the whole dataset reasonably accurately. I’m going to perform it by using Singular Value Decomposition (SVD) as the matrix decomposition method.
To perform the PCA, the data needs to be standardised, i.e. all of the variables must be fit to normal distribution so that the mean of every variable is zero. This is because PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. Thus, all of the variables need to be reasonably close to the normal distribution, and as this isn’t the case with our dataset, the standardisation is essential.
The standardisation as well as the actual PCA and summarising and plotting the results can be executed as follows:
helsinki_std <- as.data.frame(scale(helsinki)) # Standardise.
pca_hki_std <- prcomp(helsinki_std) # Perform the PCA.
s_pca_hki_std <- summary(pca_hki_std) # Summarise the results.
s_pca_hki_std # Print summary.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 3.2657 2.1913 1.43035 1.0112 0.9459 0.74790
## Proportion of Variance 0.5078 0.2286 0.09742 0.0487 0.0426 0.02664
## Cumulative Proportion 0.5078 0.7365 0.83391 0.8826 0.9252 0.95185
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.54105 0.45308 0.40936 0.33064 0.2550 0.2199
## Proportion of Variance 0.01394 0.00978 0.00798 0.00521 0.0031 0.0023
## Cumulative Proportion 0.96579 0.97556 0.98354 0.98875 0.9918 0.9941
## PC13 PC14 PC15 PC16 PC17
## Standard deviation 0.20820 0.18538 0.13116 0.11032 0.08618
## Proportion of Variance 0.00206 0.00164 0.00082 0.00058 0.00035
## Cumulative Proportion 0.99621 0.99785 0.99867 0.99925 0.99960
## PC18 PC19 PC20 PC21
## Standard deviation 0.07200 0.05350 0.01855 0.003193
## Proportion of Variance 0.00025 0.00014 0.00002 0.000000
## Cumulative Proportion 0.99985 0.99998 1.00000 1.000000
# Extract percentages of variance from the summary (for the plot titles).
pr_shs <- round(100*s_pca_hki_std$importance[2, ], digits = 1)
pc_shs_lab <- paste0(names(pr_shs), " (", pr_shs, "%)") # Create plot axis titles.
# Plot.
biplot(pca_hki_std, choices = 1:2, cex = c(0.7, 0.8), col = c("grey40", "deeppink2"), xlab = pc_shs_lab[1], ylab = pc_shs_lab[2])
While the PCA does produce as many principal components as there are variables in the original data, we can see from the summary output that the first two of them explain over 70 % of the total variance in the data. Because the angle of the arrows in the plot indicate the correlation between the original features and the principal components and their length their proportion of the total variance, it is now more easy to reliably say which variables affect the features in which way, and which are more important than the others. From the plot, we can conclude that the variables correlating strongly with the first principal component are mostly related to wellbeing and background, and the variables correlating strongly with the second principal component are mostly related to urbanisation. The variables that correlate most strongly with both of the first two principal components are living space per capita and secondary or higher education ratio, our designated target variable.
Armed with this information, the data can now be analysed further.
For an actual model explaining the chosen target variable, secondary or higher education ratio, a linear model (LM) is the most suitable. The results of the PCA can be utilised to choose those variables for the explanatory variables that explain most of the total variability in the data and that correlate most strongly with the target variable.
Linear model themselves are simply statistical models that try to model a relationship between a target variable – in this case, the ratio of secondary or higher education – and one or more explanatory variables. Their most important assumption is linearity, but there are also other assumptions (more of which below).
Initially, based on both the initial exploration of the variables and the results of the PCA, I’ve chosen the following explanatory variables for the model, in the order of the correlation between them and the target variable (most significant correlations first):
Fitting a model with abovementioned explanatory variables and printing its summary can be done as follows:
helsinki_lm <- lm(edu_sec ~ PS + emp_rate + VIHR + unemp_rate + inc_cap + oth_spk + inc_supp + pop_dens + emp_dens, data = helsinki)
lm_summary <- summary(helsinki_lm)
lm_summary
##
## Call:
## lm(formula = edu_sec ~ PS + emp_rate + VIHR + unemp_rate + inc_cap +
## oth_spk + inc_supp + pop_dens + emp_dens, data = helsinki)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3557 -0.7598 0.0275 0.9680 3.8362
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.354e+01 2.829e+01 2.953 0.007127 **
## PS -1.437e+00 3.548e-01 -4.051 0.000495 ***
## emp_rate 1.126e-01 2.789e-01 0.404 0.690206
## VIHR -2.924e-01 2.327e-01 -1.257 0.221473
## unemp_rate -1.720e+00 4.302e-01 -3.998 0.000566 ***
## inc_cap -9.855e-05 1.349e-04 -0.731 0.472372
## oth_spk -2.929e-01 1.710e-01 -1.714 0.100062
## inc_supp 1.382e+00 3.363e-01 4.111 0.000427 ***
## pop_dens 1.096e-03 2.207e-04 4.967 5.06e-05 ***
## emp_dens 1.555e-04 1.213e-04 1.282 0.212450
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.802 on 23 degrees of freedom
## Multiple R-squared: 0.9661, Adjusted R-squared: 0.9529
## F-statistic: 72.9 on 9 and 23 DF, p-value: 9.106e-15
However, the results indicate clearly that most of the explanatory variables in the model are not statistically signifincant at all, as indicated by the relative high p-values, i.e. the probabilities that the respective regression coefficients for those variables would be zero. Thus, it is better to refit the model without them:
helsinki_lm <- lm(edu_sec ~ PS + unemp_rate + inc_supp + pop_dens, data = helsinki)
lm_summary <- summary(helsinki_lm)
lm_summary
##
## Call:
## lm(formula = edu_sec ~ PS + unemp_rate + inc_supp + pop_dens,
## data = helsinki)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7752 -0.5620 0.0008 0.6435 5.0575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.4483279 2.4088090 33.813 < 2e-16 ***
## PS -1.0864686 0.1281960 -8.475 3.25e-09 ***
## unemp_rate -1.7485666 0.3672234 -4.762 5.32e-05 ***
## inc_supp 0.8729470 0.2320192 3.762 0.000792 ***
## pop_dens 0.0011705 0.0001464 7.996 1.04e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.823 on 28 degrees of freedom
## Multiple R-squared: 0.9578, Adjusted R-squared: 0.9518
## F-statistic: 158.9 on 4 and 28 DF, p-value: < 2.2e-16
We can now see that all the estimates of the regression coefficients now have statistically highly significant probabilities, even though the estimated coefficient of the population density is so small that its effect on the model is actually very tiny. The adjusted R2 also remains very, very high at 0.958: it means that 95.8% of the variability of the target variable can be explained by the model.
Note that it is important to understand that the high R2 alone does not tell anything about the usefulness of the model; in fact, adding any new variable to the model increases the R2 – and in many cases, the adjusted R2 as well, even though in principle the latter is supposed to penalise for including any unnecessary variables!
The individual explanatory variables of the model can be visualised as follows:
par(mfrow = c(2,2)) # Set some graphical params.
visreg(helsinki_lm)
In addition to linearity, linear models are usually fitted with the assumption that:
To investigate the validity of the model, it is possible to draw specific diagnostic plots about the model residuals that can be used to observe if there are any deviations from the abovementioned assumptions.
plot(helsinki_lm, which = c(2))
The Q–Q Plot can be used to investigate, if the model errors are normally distributed. In our case, this is somewhat questionable; while most of the residuals do seem to seem to follow normal distribution, there are some clear outliers. Because there aren’t very many of them, it doesn’t necessarily invalidate the whole model, but it does suggest that the model is not quite as perfect as the high R2 suggests.
plot(helsinki_lm, which = c(1))
The Residuals vs. Fitted Plot shows, if the size of the errors depends on explanatory variables (meaning that their σ2 would not be constant) and if the errors are correlated; however, because no pattern can be observed from the plot, we can safely assume that neither of these is the case.
plot(helsinki_lm, which = c(5))
Even if the model matches the abovementioned assumptions, it is also recommended to check that no single observation has an outsized effect on the model, because this might distort the model coefficients. This can be done with the Residuals vs. Leverage Plot depicted above. While there are some observations that seem to have slightly outsize influence of the model, the value of the residuals of those observations remains very close to zero. Also, because the x-axis scale of the residuals vs. leverage plot remains relatively narrow, we can conclude that the model is not severely distorted by any single observation.
While the model validation does indicate that the model is not perfect as we would ideally like, it still demonstrates that it is usable and explains the phenomenon in question – the ratio of secondary or higher education – reasonably well. Personally, I was quite shocked and totally unprepared to find out that the support of the Finns party alone explains such a high share of the variability of the target variable; even if the model is refitted by using the Finns party support as the only explanatory variable, it still explains 80.5% of the variability of the secondary or higher education ratio. None of the other explanatory variables alone explain such a high share of the target variability, with unemployment rate, income support ratio and population density having R2 values of 0.507%, 0.462% and 0.444%, respectively, if fitted as the only explanatory variables. (All of the mentioned alternative single explanatory variable models remain statistically highly significant.)
What somewhat shadows the model, though, is that even though the correlation between the target variable and the income support explanatory variable is negative at -0.68, the estimate of its coefficient in the model is positive at 0.87. Not only this feels illogical but also, if the model is refitted with income support as the only explanatory variable, the coefficient is negative at -1.03. This situation would most likely warrant further investigation, but I personally feel that both the time constraints of this course and the limits of my skills put that out of the scope of this work; I did, however, some brief exploratins of alternate models that did not include all the explanatory variables of the final model, but the validation results of those models suggested that they violated some critical assumptions of linear models more obviously than the final model and that making predictions with them would thus be even more problematic than with the final model, so I decided to keep with the now final model.
Also, there are some other phenomena potentially explained by the data, such as the reasons behind high ratio of child protection, that might warrant further research.